During this short tutorial we will go through a sample workflow with the DESeq2 R package (see Anders and Huber (2010)).

1 Overview of this practical

We will only go through all the standard functions and analyses of DESeq in this practical.

DESeq2 Analysis

  • Import data
  • DESeqDataSet
  • Normalisation (size factors)
  • Differential expression
  • Result tables

2 Data used in this case study

Loss of function of myosin chaperones triggers Hsf1-mediated transcriptional response in skeletal muscle cells
Christelle Etard, Olivier Armant, Urmas Roostalu, Victor Gourain, Marco Ferg and Uwe Strähle
Genome Biology 2015 16:267 https://doi.org/10.1186/s13059-015-0825-8

RNA-seq_Strahle_Lab_0005AS.<SequencingID>.USERvgourain.R.ReadsPerGene.out.tab
Hpf wt unc45b
24 DCD001548SQ DCD001560SQ
DCD001559SQ DCD001554SQ
48 DCD001546SQ DCD001564SQ
DCD001558SQ DCD001555SQ
72 DCD001547SQ DCD001565SQ
DCD001545SQ DCD001551SQ

All the libraries were unstranded paired-ended sequenced on Illumina HiSeq 2000 producing 50bp long reads.

3 Data preparation

Generate count tables of reads per gene

example of ReadsPerGene.out.tab table:

$ head RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.tab
N_unmapped  1100914 1100914 1100914
N_multimapping  11275176    11275176    11275176
N_noFeature 6090090 14560190    14670407
N_ambiguous 478755  112331  109422
ENSDARG00000104632  26  14  13
ENSDARG00000100660  42  19  27
ENSDARG00000098417  10  8   3
ENSDARG00000100422  2406    1201    1219
ENSDARG00000102128  250 133 118
ENSDARG00000103095  621 318 307

select columns (gene IDs, unstranded reads) and ignore the first 4 lines

$ tail +5 RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.tab | awk '{print $1 "\t" $2}' > RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.DESeq.tab
$ tail +5 RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.ReadsPerGene.out.tab | awk '{print $1 "\t" $2}' > RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.ReadsPerGene.out.DESeq.tab
$ tail +5 RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.ReadsPerGene.out.tab | awk '{print $1 "\t" $2}' > RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.ReadsPerGene.out.DESeq.tab
$ tail +5 RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.ReadsPerGene.out.tab | awk '{print $1 "\t" $2}' >  RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.ReadsPerGene.out.DESeq.tab
$ htseq-count [options] <alignment_files> <gff_file>

  Options:
    -s, whether the data is from a strand-specific assay 
    -r, the alignment have to be sorted either by read name or by alignment position
    -f, input format (SAM/BAM)

4 Import data into DESeq2 & create a DESeqDataSet

First you will want to specify a variable which points to the directory in which the htseq-count output files are located.

library("DESeq2")
directory <- "../data/provided_files/data_DESeq/"

We specify which files to read in using list.files, and select those files contain the extension “.tab” using grep. Then we set the sample conditions for downstream analyses. Finally, we make a dataframe containing this information.

sampleFiles <- grep("*.tab", list.files(directory), value = TRUE)
sampleCondition <- c("wt", "unc45b", "wt", "unc45b")
sampleTable <- data.frame(sampleName = sampleFiles, fileName = sampleFiles, 
    condition = sampleCondition)
# show the sampleTable
sampleTable
##                                                                   sampleName
## 1 RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.ReadsPerGene.out.tab
## 2 RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.tab
## 3 RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.ReadsPerGene.out.tab
## 4 RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.ReadsPerGene.out.tab
##                                                                     fileName
## 1 RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.ReadsPerGene.out.tab
## 2 RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.tab
## 3 RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.ReadsPerGene.out.tab
## 4 RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.ReadsPerGene.out.tab
##   condition
## 1        wt
## 2    unc45b
## 3        wt
## 4    unc45b

Then we build the DESeqDataSet using the following function as we have htseq-count files:

ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, 
    design = ~condition)

5 Differential expression analysis

The standard differential expression analysis steps are wrapped into a single function, DESeq.

dds <- DESeq(ddsHTSeq)

6 Export normalised counts

normalised.table <- (counts(dds, normalized = T))
write.csv(normalised.table, "../data/results/DESeq2-wt_vs_unc45b-normalised.counts.csv")

7 Diagnostics

7.1 Size factors for normalisation

The size factors are accessible via sizeFactors:

sizeFactors(dds)
## RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.ReadsPerGene.out.tab 
##                                                                  1.2696225 
## RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.tab 
##                                                                  0.7059657 
## RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.ReadsPerGene.out.tab 
##                                                                  0.9633211 
## RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.ReadsPerGene.out.tab 
##                                                                  1.1855957

7.2 Dispersion plot

For every gene, a Negative Binomial (NB) distribution is fitted based on the counts to estimate the dispersion in 3 steps:

  • estimates dispersion parameter for each gene

  • plots and fits a curve

  • adjusts the dispersion parameter towards the curve

Legend:

  • black dots: estimated from normalised data
  • red line: fitted curve
  • blue dots: final assigned dispersion parameter for that gene
plotDispEsts(dds)

8 Results

Results tables are generated using the function results(), which extracts a results table with log2 fold-changes, p-values and adjusted p-values. With no additional arguments to results, the log2 fold-change and Wald test p-value will be for the last variable in the design formula, and if this is a factor, the comparison will be the last level of this variable over the first level (unc45b over wt).

Details about the comparison are printed to the console, above the results table. The text, condition treated vs untreated, tells you that the estimates are of the logarithmic fold-change log2 (wt vs unc45b).

# get results
res <- results(dds)
# order for easy view:
res.ord <- res[order(res$padj), ]
# let's check the top differentially expressed genes!
head(res.ord)
## log2 fold change (MLE): condition wt vs unc45b 
## Wald test p-value: condition wt vs unc45b 
## DataFrame with 6 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE      stat
##                    <numeric>      <numeric> <numeric> <numeric>
## ENSDARG00000037403 14439.485      -8.296490 0.1813989 -45.73616
## ENSDARG00000056210  6310.683      -7.550094 0.2408243 -31.35105
## ENSDARG00000094719  2027.783      -5.489757 0.1922252 -28.55899
## ENSDARG00000012381  8790.944      -3.916069 0.1499467 -26.11640
## ENSDARG00000055723  4247.576      -5.538385 0.2152812 -25.72629
## ENSDARG00000041065 11004.821      -5.144954 0.2059325 -24.98369
##                           pvalue          padj
##                        <numeric>     <numeric>
## ENSDARG00000037403  0.000000e+00  0.000000e+00
## ENSDARG00000056210 9.414973e-216 8.257873e-212
## ENSDARG00000094719 2.172252e-179 1.270188e-175
## ENSDARG00000012381 2.374718e-150 1.041432e-146
## ENSDARG00000055723 5.940092e-146 2.084022e-142
## ENSDARG00000041065 9.195203e-138 2.688371e-134

We can summarize some basic tallies using the summary function.

summary(res)
## 
## out of 28527 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)     : 132, 0.46% 
## LFC < 0 (down)   : 302, 1.1% 
## outliers [1]     : 0, 0% 
## low counts [2]   : 10985, 39% 
## (mean count < 34)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

8.1 Lowering the adjusted p-value

How many adjusted p-values were less than 0.05?

sum(res$padj < 0.05, na.rm = TRUE)
## [1] 337


The results function contains a number of arguments to customize the results table which is generated. You can read about these arguments by looking up ?results. Note that the results function automatically performs independent filtering based on the mean of normalised counts for each gene, optimizing the number of genes which will have an adjusted p-value below a given FDR cutoff, alpha. Independent filtering is further discussed below. By default the argument alpha is set to 0.1. If the adjusted p-value cutoff will be a value other than 0.1, alpha should be set to that value: here that would be 0.05.

res05 <- results(dds, alpha = 0.05)
res05 <- na.omit(res05)  #remove NA values
summary(res05)
## 
## out of 16993 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)     : 93, 0.55% 
## LFC < 0 (down)   : 249, 1.5% 
## outliers [1]     : 0, 0% 
## low counts [2]   : 0, 0% 
## (mean count < 42)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results


Export results (FDR 0.05)

write.csv(as.data.frame(res05), file = "../data/results/DESeq2-wt_vs_unc45b-FDR05.csv")

9 Exploring results

In DESeq2, the function plotMA shows the log2 fold-changes attributable to a given variable over the mean of normalised counts for all the samples in the DESeqDataSet. Points will be colored red if the adjusted p value is less than 0.1. Points which fall out of the window are plotted as open triangles pointing either up or down.

plotMA(dds, alpha = 0.05, ylim = c(-5, 5), main = "DESeq2-MAplot: WT vs unc45b")

9.1 Some examples

You can plot single genes with the plotCounts() function. As long as you know the row number of the gene of interest. So when we want to look at the most significant differentially expressed gene:

plotCounts(dds, gene = which.min(res$padj), intgroup = "condition")

Or just your favourite gene PARP12:

plotCounts(dds, gene = which(rownames(res) == "ENSDARG00000100660"), intgroup = "condition")

10 Extracting transformed values

In order to test for differential expression, we operate on raw counts and use discrete distributions as described in the previous section on differential expression. However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data.

Maybe the most obvious choice of transformation is the logarithm.

One makes use of the concept of variance stabilizing transformations (VST) (Tibshirani (1988); Huber et al. (2003); Anders and Huber (2010)), and the other is the regularized logarithm or rlog, which incorporates a prior on the sample differences (Love, Huber, and Anders (2014)). Both transformations produce transformed data on the log2 scale which has been normalised with respect to library size or other normalisation factors.

The assay function is used to extract the matrix of normalised values.

rld <- rlogTransformation(dds, blind = TRUE)
vsd <- varianceStabilizingTransformation(dds, blind = TRUE)

head(assay(vsd), 3)
##                    RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632                                                                   7.142512
## ENSDARG00000100660                                                                   7.489663
## ENSDARG00000098417                                                                   6.897470
##                    RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632                                                                   6.930279
## ENSDARG00000100660                                                                   7.220474
## ENSDARG00000098417                                                                   6.504708
##                    RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632                                                                   6.841847
## ENSDARG00000100660                                                                   7.466815
## ENSDARG00000098417                                                                   6.953443
##                    RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632                                                                   6.934472
## ENSDARG00000100660                                                                   7.234222
## ENSDARG00000098417                                                                   7.146909
head(assay(rld), 3)
##                    RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632                                                                   5.396667
## ENSDARG00000100660                                                                   6.260150
## ENSDARG00000098417                                                                   5.121452
##                    RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632                                                                   5.281471
## ENSDARG00000100660                                                                   6.105386
## ENSDARG00000098417                                                                   4.963018
##                    RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632                                                                   5.236849
## ENSDARG00000100660                                                                   6.244054
## ENSDARG00000098417                                                                   5.149828
##                    RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632                                                                   5.281899
## ENSDARG00000100660                                                                   6.107221
## ENSDARG00000098417                                                                   5.257630

10.1 Sample to sample distances

Another use of the transformed data is sample clustering. Here, we apply the dist function to the transpose of the transformed count matrix to get sample-to-sample distances and plot in a heatmap.

library("RColorBrewer")
library("gplots")

distsRL <- dist(t(assay(rld)))
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)

mat <- as.matrix(distsRL)
rownames(mat) <- colnames(mat) <- with(colData(dds), paste(condition, sep = " : "))
heatmap.2(mat, trace = "none", col = rev(hmcol), margin = c(13, 13), cexRow = 0.9, 
    cexCol = 0.9)

10.2 PCA plot

Related to the distance matrix is the PCA plot, which shows the samples in the 2D plane spanned by their first two principal components. This type of plot is useful for visualizing the overall effect of experimental covariates and batch effects.

print(plotPCA(rld, intgroup = c("condition")))

11 Sessioninfo

print(sessionInfo())
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] gplots_3.0.1               RColorBrewer_1.1-2        
##  [3] DESeq2_1.18.1              SummarizedExperiment_1.8.0
##  [5] DelayedArray_0.4.1         matrixStats_0.52.2        
##  [7] Biobase_2.38.0             GenomicRanges_1.30.0      
##  [9] GenomeInfoDb_1.14.0        IRanges_2.12.0            
## [11] S4Vectors_0.16.0           BiocGenerics_0.24.0       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.14            locfit_1.5-9.1         
##  [3] lattice_0.20-35         gtools_3.5.0           
##  [5] rprojroot_1.2           digest_0.6.12          
##  [7] plyr_1.8.4              backports_1.1.1        
##  [9] acepack_1.4.1           RSQLite_2.0            
## [11] evaluate_0.10.1         ggplot2_2.2.1          
## [13] zlibbioc_1.24.0         rlang_0.1.4            
## [15] lazyeval_0.2.1          gdata_2.18.0           
## [17] data.table_1.10.4-3     annotate_1.56.1        
## [19] blob_1.1.0              rpart_4.1-11           
## [21] Matrix_1.2-12           checkmate_1.8.5        
## [23] rmarkdown_1.8           labeling_0.3           
## [25] splines_3.4.2           BiocParallel_1.12.0    
## [27] geneplotter_1.56.0      stringr_1.2.0          
## [29] foreign_0.8-69          htmlwidgets_0.9        
## [31] RCurl_1.95-4.8          bit_1.1-12             
## [33] munsell_0.4.3           compiler_3.4.2         
## [35] base64enc_0.1-3         htmltools_0.3.6        
## [37] nnet_7.3-12             tibble_1.3.4           
## [39] gridExtra_2.3           htmlTable_1.9          
## [41] GenomeInfoDbData_0.99.1 Hmisc_4.0-3            
## [43] XML_3.98-1.9            bitops_1.0-6           
## [45] grid_3.4.2              xtable_1.8-2           
## [47] gtable_0.2.0            DBI_0.7                
## [49] magrittr_1.5            formatR_1.5            
## [51] scales_0.5.0            KernSmooth_2.23-15     
## [53] stringi_1.1.6           XVector_0.18.0         
## [55] genefilter_1.60.0       latticeExtra_0.6-28    
## [57] Formula_1.2-2           tools_3.4.2            
## [59] bit64_0.9-7             survival_2.41-3        
## [61] yaml_2.1.14             AnnotationDbi_1.40.0   
## [63] colorspace_1.3-2        cluster_2.0.6          
## [65] caTools_1.17.1          memoise_1.1.0          
## [67] knitr_1.17

References

Anders, Simon, and Wolfgang Huber. 2010. “Differential expression analysis for sequence count data.” Genome Biology 11 (10). BioMed Central: R106. doi:10.1186/gb-2010-11-10-r106.

Huber, Wolfgang, Anja von Heydebreck, Holger Sueltmann, Annemarie Poustka, and Martin Vingron. 2003. “Parameter estimation for the calibration and variance stabilization of microarray data.” Statistical Applications in Genetics and Molecular Biology 2 (1): Article3. doi:10.2202/1544-6115.1008.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12): 550. doi:10.1186/s13059-014-0550-8.

Tibshirani, Robert. 1988. “Estimating Transformations for Regression Via Additivity and Variance Stabilization.” Journal of the American Statistical Association 83 (402). Taylor & Francis, Ltd.American Statistical Association: 394. doi:10.2307/2288855.